Molecular recognition in a lattice model: An enumeration study 
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We investigate the mechanisms underlying selective molecular recognition of single heteropolymers 
at chemically structured planar surfaces. To this end, we study systems with two-letter (HP) lattice 
heteropolymers by exact enumeration techniques. Selectivity for a particular surface is defined by 
an adsorption energy criterium. We analyze the distributions of selective sequences and the role of 
mutations. A particularly important factor for molecular recognition is the small-scale structure on 
the polymers. 

PACS numbers: 87.15.Aa, 87.14.Ee, 46.65.+g, 68.35.Md 



Selective molecular recognition governs many biolog- 
ical processes such as DNA-protein binding^ or cell- 
mediated recognition^. Biotechnological applications 
range from the development of bioscnsoric materials^ to 
cell-specific drug-targetingi. The specificity in these pro- 
cesses results from the interplay of a few unspecific inter- 
actions (van der Waals forces, electrostatic forces, hydro- 
gen bonds, and the hydrophobic force and a hetero- 
geneous composition of the polymer chain. Selectivity 
is a genuinely cooperative effect. The question how it 
emerges in a complex system is therefore very interest- 
ing from the point of view of statistical physics, and the 
study of idealized models can provide insight into general 

Previous theoretical studies have mostly considered 
heteropolymer adsorption in either regulasifi or ran- 
systems. The interplay of cluster sizes on 
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random heteropolymers and random surfaces and its in- 
fluence on the adsorption thermodynamics and kinet- 
ics was studied analytically and with computer simula- 
tion oW 3 . Concepts from the statistical physics of spin 
glasses were used to study the adsorption of polymers on 
a "native" surface compared with that on an arbitrary 
random surfac o 6 ' 8 ' 14 . 

In the present paper, we focus on a different question: 
We investigate mechanisms by which specific heteropoly- 
mers distinguish between given surfaces. To this end, we 
adopt an approach which has proven highly rewarding in 
the context of the closely related problem of protein fold- 
m gi5 i i6 i i7 i i8 : enumerate exactly all compact polymer 
conformations within a lattice model. The protein is de- 
scribed as a heteropolymer chain consisting of two types 
of monomers, hydrophobic (H) and polar (P), which oc- 
cupy each one site on the lattice. Sites surrounding the 
polymer are assumed to contain solvent. The protein 
is exposed to an impenetrable flat surface covered with 
sites of either type H or type P, which form a particu- 
lar surface pattern. It may adsorb there and change its 
conformation during the adsorption process. However, 
we require that both the free and the adsorbed chain are 
compactly folded in a given shape (cubic or rectangu- 
lar Nearest neighbor particles interact with fixed, 
type dependent interaction energies. Surface sites H and 
P are considered to be equivalent to monomer sites H 



and P. The total energy is then given by: 
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Here the sum < i,j > runs over nearest neighbor pairs, 
the sums a and (3 run over the types hydrophobic (H), 
polar (P), or solvent (S), and tJ is an occupation num- 
ber which takes the value one if the site i is occupied 
with type 7, and zero otherwise. For compact chains 
with a fixed sequence, the energy spectrum as defined 
by Eq. Q is (except for a fixed offset) fully character- 
ized by only two parameters: One which describes the 
relative incompatibility of H and P inside the globule, 
V = 2Ehp — Ehh ~ Epp, and one which accounts for 
the difference between the affinities of H and P to the 
solvent, W = 2(E H s — Eps) + (E PP — E H h)- Since one 
of these parameters sets the energy scale, the model has 
only one dimensionless free parameter, V/W. Motivated 
by RefJi, where V/W = 0.13, we chose V/W = 0.1. 

We consider two-dimensional and three-dimensional 
systems with system sizes up to 6 x 6 (in 2D) and 3x3x3 
(in 3D), respectively. For each system, a set of se- 
quences was picked randomly (uncorrelated monomers, 
equal probability for H and P). For each sequence, we 
then evaluated the energies for all possible compact chain 
conformations in contact with all possible surfaces. This 
allowed us to determine exactly the ground-state adsorp- 
tion energy on every surface. We call a sequence selective, 
if there exists one unique surface with highest adsorption 
energy, i.e., if the difference 

-frgap - E ad - £ ad . (A) 

between the adsorption energies on the two most favor- 
able surfaces is nonzero. The lowest-energy structure of 
the chain on its favorite surface (the "selected" surface) 
is not necessarily unique. 

We note that this selectivity criterion is a "zero- 
temperature" criterion. Entropic contributions to the 
adsorption free energy are not accounted for. Further- 
more, we disregard dynamic and kinetic factors^, which 
presumably also play a role in molecular recognition pro- 
cesses. 

In all systems, more than 90% of all sequences were 
selective. The distribution on the different surfaces was 



2 



highly inhomogeneous, see Fig.Q] A closer inspection re- 
veals that two main factors contribute to the frequency 
with which sequences select a particular surface pattern: 
A high number of hydrophobic sites inside the pattern is 
beneficial, whereas hydrophobic sites at the border are 
unfavorable. This is due to the fact that bound pro- 
teins prefering the latter surface patterns must have hy- 
drophobic monomers at the edges. The resulting unfa- 
vorable contacts to the solvent have to be compensated 
to achieve an energetic minimum. This reduces the num- 
ber of suited sequences. The frequency distribution could 
be fitted remarkably well by the simple formula 
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where n core denotes the number of hydrophobic core 
sites, riborder the number of hydrophobic border sites, 
m the total number of core sites, and A and B are fit 
parameters. For the 5x5 system, such a fit is illustrated 
in Fig.^ The fitting is also successful for other systems, 
even for the 3D case, if one identifies sites at the corner 
of the surface with border sites. The functional form of 
Eq. © was guessed empirically, with no underlying the- 
ory, and should not be over-interpreted. Nevertheless, we 
can conclude that the relative frequency of surface pat- 
terns is mostly determined by a few, unspecific surface 
characteristics. 

The previous analysis raises the question how se- 
quences which are selective for different surfaces dif- 
fer from one another, or, conversely, which features se- 
quences belonging to the same surface have in common. 
We have used different approaches to address this prob- 
lem. 



CZ1 Estimation 

■ 100 000 randomly sampled sequences 
! I reduced set of 326 sequences 




Surface structures (one dimensional) 

FIG. 1: Relative frequency of sequences selective for different 
surfaces on the 5x5 lattice. The black bars show the result 
for a random sample, the gray bars for a sample based on a 
"master sequence" . Also shown are the values obtained with 
a least-square fit to Eq. (|HJ (see text for explanation). 
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FIG. 2: Histograms showing the distribution of Hamming 
distances for the 5x5 (left) and the 3x3x3 (right) lattice 
in set of sequences belonging to several surfaces (lines). Also 
shown for comparison are the results for a set of 5000 random 
sequences (crosses). 



The first approach was motivated by the biological 
principle of mutation. A similarity measure between two 
chain sequences can be defined by counting the mini- 
mum number of point mutations required to construct 
one sequence, starting from the other. For our two-letter 
sequences, this is quantified using the Hamming distance 
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between sequences s and s' . The sum i runs over all 
monomers along the chain, and the variables s,, are 
taken to be Sj , s[ = 1 if the zth monomer of the sequence 
s is hydrophobic, and Si,s^ = —1 otherwise. Two se- 
quences that have a Hamming distance of n are thus sep- 
arated by n point mutations. Since sequences can be read 
in both directions, Eq. J3J usually yields two values for a 
pair of sequences. We have always used the smaller one. 

Based on this definition, we can now study whether 
sequences belonging to the same surface are "close" in 
sequence space. Examples of distributions of Hamming 
distances for different surfaces are shown in Fig. [3 The 
distributions for different surfaces, and even for different 
system sizes, are very similar. The number of mutations 
with the highest occurrence is nearly half the total num- 
ber of monomers in the polymer chain. Moreover, the 
distribution is not very different from that of a totally 
random set of sequences, which is also shown in Fig. [21 
for comparison. Hence we conclude that the sequences 
selective for a particular surface are widely distributed 
over the sequence space, and that proximity in sequence 
space is not a relevant factor for molecular recognition. 

This result has interesting practical consequences. An 
important issue for many cell-surface recognition pro- 
cesses is the question how efficiently nature distinguishes 
between different surfaces^, i.e., how many mutations 
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are required to change a polymer sequence that is selec- 
tive for a particular surface to make it selective to an- 
other surface or a whole class of different surfaces. In 
our model, the observation that sequences selective to 
the same surface appear to be widely spread in sequence 
space suggests that one might find sequences which are 
selective to very different surfaces at close vicinity in se- 
quence space. 

In order to test this idea, we have attempted to 
compute subsets of sequences, which are close in se- 
quence space and nevertheless "recognize" all surfaces, 
i.e., which contain at least one selective sequence for each 
surface. Such sets were constructed following a two-step 
procedure. First, we identified a center or master se- 
quence, which was a suitable initial point for the mu- 
tation process. This was done mainly by trial and er- 
ror, starting from the sequences belonging to the least 
favorable surfaces. Second, we evaluated the number of 
mutations necessary to provide a subset of sequences rec- 
ognizing all surfaces. This analysis was carried out for 
different two-dimensional systems. The results are shown 
in table [I] In spite of the exponential growth in the num- 
ber of possible polymer chain conformations and possible 
sequence realizations, the number of necessary mutations 
r in table [I] increases only slightly with the surface size. 
The distribution of the sequences on the surfaces is shown 
for one of these reduced subsets in Fig.fl] and can be com- 
pared with the full distribution. The general features are 
comparable. 

We note that the values r for the minimum number 
of mutations required to recognize all surfaces, as given 
in table [IJ are upper limits and can possibly be reduced 
further with more efficient master sequences. Even so, 
r is in some cases smaller than the minimum number 
of mutations necessary to generate all surfaces (starting 
from a common master surface). Hence only a few point 
mutations can alter the adsorption characteristics pro- 
foundly. This result matches with experimental results 
obtained from binding force measurements on antibod- 
ies^. Experimentally, it was observed that the wild-type 
antibody and a mutant in which an amino acid at one 
position in the chain has been exchanged differ in the 
measured affinity by roughly one order of magnitude. 

We return to the problem of determining common fea- 
tures of sequences which are selective for the same sur- 
face. To clarify the question whether there exist any 

TABLE I: Number of mutations r necessary to generate a 
subset of sequences which recognize all surfaces, together with 
the corresponding subset size for various lattice sizes. In the 
case of rectangular folding (5x4 and 6x5), the largest side 
forms the interface to the surface. 
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FIG. 3: Performance of the fully-connected two-layer percep- 
tron trained for several surface structures on the 5x5 and 
3x3x3 lattice displayed in a sensitivity (true positive) versus 
specificity (true negative) plot. The diagonal line represents 
results with a 50% correct classification rate corresponding 
to random guessing. For the 6x6 system the results were 
obtained by a fully-connected three-layer perceptron with 16 
hidden units. In all cases the data have been transformed to 
Fourier space and the perceptron was optimized via a back- 
propagation algorithm, see Refi— 



such features, we have applied an artificial neural net- 
work (ANN). After training the ANN with a set, com- 
posed equally from selective as well as non selective se- 
quences for a given surface, the performance of the ANN 
was tested with a second, disjoint set. This analysis was 
performed for all surfaces with at least 100 selective se- 
quences. The results of the testing, Fig. 01 show that 
there do exist relevant features for the recognition pro- 
cess that can be learned by the ANN. 

The next question is: What does the ANN learn? In 
the case of a two-layer perceptron, the answer is rela- 
tively simple 22 : The ANN classifies by dividing the se- 
quence space of dimension N into two parts by a N — 1 
dimensional hyper-plane. The fact that this classifica- 
tion is successful suggests that insight might be gained 
by a more general characterization than the mere mu- 
tual (Hamming) distances. In order to achieve this we 
applied the "Principal Component Analysis" (PCA)2£. 
In this approach, the data, i.e. in the present paper the 
discrete Fourier transform of the sequences, are treated 
as a random vector. In general the modes are corre- 
lated, in particular if common features within a set of 
sequences exist. This is characterized by the variances 
and covariances given in the covariance matrix. Diago- 
nalization of this matrix yields a description by uncorre- 
cted components, the eigenvectors. The eigenvalues are 
a measure for the squared variances of these components. 
Low eigenvalues correspond to characteristic components 
within the set. 

We have carried out PCAs for various surfaces in the 
5x5-, the 6x6-, and the 3x3x3 system. The results 
revealed an unexpected common feature: For all surfaces 
in the 5x5- and the 3x3x3 system, two components 
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FIG. 4: (a) Square roots of the eigenvalues of the covariance 
matrix and (b) coordinates in Fourier space (g-space) of the 
eigenvector corresponding to the smallest variance (circles in 
a) for three surface patterns of the 3x3x3 system. 



comprising hydrophilic and hydrophobic monomer units. 
Starting from already folded conformations, we investi- 
gated distributions of selective sequences and the role of 
point mutations. We found that sequences recognizing 
the same surface are widely distributed in sequence space, 
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turned out to be especially meaningful, namely almost 
exactly the highest frequency modes (real and imaginary 
part). The corresponding variances were considerably 
smaller than those of all other components, see Fig. 0] 
In the 6x6 system, the result was not as simple, yet the 
high-frequency components were still among the signifi- 
cant components. 

These results can be visualized by projecting the se- 
quence space onto the highest-frequency plane. Fig. [5] 
illustrates for the 3 x 3 x 3-system that sets of sequences 
belonging to different surfaces often occupy different re- 
gions in this plane. 

To summarize, we have studied the recognition of 
chemically structured surfaces by single polymer chains 



FIG. 5: Projection of sequences on the highest frequency (ui) 
plane for the 3x3x3 lattice and various surface structures. 
Some of the sets are completely separated in this plane. 

i.e., they are separated by many mutations. Conversely, 
it was in many cases possible to construct a subset of 
sequences which recognize all surfaces and nevertheless 
differ from one another by only a few mutations. Despite 
their wide distribution, sequences recognizing the same 
surface have features in common, which can be learned 
by a neural network. One factor which turned out to be 
particularly important in this recognition process is the 
local, small-scale structure on the polymers. 

We thank Alexey Polotsky for useful discussions and 
the german science foundation (DFG) for partial support. 
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